A spoken term detection framework for recovering out-of-vocabulary words using the web
نویسندگان
چکیده
Vocabulary restrictions in large vocabulary continuous speech recognition (LVCSR) systems mean that out-of-vocabulary (OOV) words are lost in the output. However, OOV words tend to be information rich terms (often named entities) and their omission from the transcript negatively affects both usability and downstream NLP technologies, such as machine translation or knowledge distillation. We propose a novel approach to OOV recovery that uses a spoken term detection (STD) framework. Given an identified OOV region in the LVCSR output, we recover the uttered OOVs by utilizing contextual information and the vast and constantly updated vocabulary on the Web. Discovered words are integrated into system output, recovering up to 40% of OOVs and resulting in a reduction in system error.
منابع مشابه
Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملThe Effect of Using Word Clouds on EFL Students’ Long- Term Vocabulary Retention
Vocabulary is an important component in all four skills of language. Issue of vocabulary retention has great importance to EFL teachers in instructional contexts because they always ...
متن کاملWord-based probabilistic phonetic retrieval for low-resource spoken term detection
Two problems make Spoken Term Detection (STD) particularly challenging under low-resource conditions: the low quality of speech recognition hypotheses, and a high number of out-ofvocabulary (OOV) words. In this paper, we propose an intuitive way to handle OOV terms for STD on word-based Confusion Networks using phonetic similarities, and generalize it into a probabilistic and vocabulary-indepen...
متن کاملHybrid word-subword spoken term detection
The thesis investigates into keyword spotting and spoken term detection (STD), that are considered as sub-sets of spoken document retrieval. It deals with two-phase approaches where speech is first processed by speech recognizer, and the search for queries is performed in the output of this recognizer. Standard large vocabulary continuous speech recognizer (LVCSR) with fixed vocabulary is not c...
متن کاملRecurrent Out-of-Vocabulary Word Detection Using Distribution of Features
The repeated use of out-of-vocabulary (OOV) words in a spoken document seriously degrades a speech recognizer’s performance. This paper provides a novel method for accurately detecting such recurrent OOV words. Standard OOV word detection methods classify each word segment into in-vocabulary (IV) or OOV. This word-by-word classification tends to be affected by sudden vocal irregularities in spo...
متن کامل